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Abstract Sparse coding (Sc) has been studied very well as a powerful data 
representation method. It attempts to represent the feature vector of a data 
sample by reconstructing it as the sparse linear combination of some basic 
elements, and a L2 norm distance function is usually used as the loss function 
for the reconstruction error. In this paper, we investigate using Sc as the rep¬ 
resentation method within multi-instance learning framework, where a sample 
is given as a bag of instances, and further represented as a histogram of the 
quantized instances. We argue that for the data type of histogram, using L2 
norm distance is not suitable, and propose to use the earth mover’s distance 
(EMD) instead of L2 norm distance as a measure of the reconstruction er¬ 
ror. By minimizing the EMD between the histogram of a sample and the its 
reconstruction from some basic histograms, a novel sparse coding method is 
developed, which is refereed as SC-EMD. We evaluate its performances as a 
histogram representation method in tow multi-instance learning problems — 
abnormal image detection in wireless capsule endoscopy videos, and protein 
binding site retrieval. The encouraging results demonstrate the advantages of 
the new method over the traditional method using L2 norm distance. 
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1 Introduction 

Sparse Coding (SC) has been recently proposed and studied well as an effective 
data representation method in machine learning community |431l74l[5ni73l [2l 
[S5] . Civen a set of basic elements and a data sample, SC tries to represent the 
sample by reconstructing it as a linear combination of these basic elements. The 
linear combination coefficient vector could be used as the new representation 
of the sample. To this end, the basic elements and the coefficient vector (also 
called coding vector) are learned by minimizing the reconstruction error. At the 
same time, we also hope that the coding vector could be as sparse as possible. 
To measure the reconstruction error, a squared L 2 norm distance is usually 
applied to compare the original feature vector and its sparse linear combination 
as a loss function. At the same time, a Li norm regularization term is also 
imposed to the coding vector to seek its sparsity. The advantage of using the 
L 2 norm distance to as the loss function and using Li norm regularization for 
sparsity purpose lies on that it is easy to optimize and interpret. The feature- 
sign search method had been proposed to solve the SC problem by Lee et al. 
in m- Some different SC versions had also be proposed since then, by adding 
different bias terms to the original SC loss function based on L 2 norm distance 
[28ir77ll?7] . 

In the multi-instance learning framework, each sample is given as a bag of 
multiple instances, instead of one single instance in the traditional machine 
learning problem [TnUTRlIRmiTT^ . For example, in image classification and re¬ 
trieval problems, an image could be split into many small image patches, and 
each patch is an instance. In this case, we usually first learn a set of instance 
prototypes by clustering the instances of the training samples, and then rep¬ 
resent a sample by quantizing its instances into the instance prototypes, and 
obtain a quantization histogram piiKniiin] . The normalized histogram is used 
as the feature vector of the sample for further classification or retrieval task. 
When we try to apply the SC to represent the histogram data samples under 
the multi-instance learning framework, directly using L 2 norm distance may 
not be suitable anymore. Other distance functions which is especially suitable 
for histogram data is desired. In fact, many distance functions have been stud¬ 
ied for histogram data comparison, such as Kullback — Leibler divergence [5^ 
1551155] . distance [551IT51I75] . Earth Mover’s Distance (EMD) [5511571155] . etc. 
Among these distance functions, the EMD metric has been known to quantify 
the errors in histogram comparison better than other distance metrics. 

In this paper, we propose the first SC method with EMD metric for the 
representation of histogram data. Instead of using L 2 norm distance, we model 
the sparse coding problem by using the EMD to constructed the loss function. 
The newly proposed method, SC-EMD is especially suitable for the represen¬ 
tation of histogram data under the multi-instance learning framework. 

This rest parts of this paper continue as follows: the formal objective func¬ 
tion of SC-EMD, the linear programming-based optimization, and an iterative 
learning algorithm, are presented in section [5] experiments with three actual 
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multi-instance learning tasks are presented in section |31 and conclusions are 
given in sectional 


2 Sparse Coding with Earth Mover’s Distance 

In this section, we will introduce the novel sparse coding method using EMD 
as the distance metric, instead of the traditional squared L 2 norm distance, 
for the representation of histogram data. 


2.1 Objective function 

Assuming that we have a training set of N data samples. We represent each 
data sample as a bag of multiple instances under the framework of multi¬ 
instance learning. To extract the feature vector from the n-th sample, we 
quantize the instances of the n-th sample into a set of instance prototypes, 
and use the quantization histogram as the feature vector. We assume that the 
number of instance prototypes is D, thus the feature vector of the n-th sample 
is a normalized D-dimensional histogram, denoted as x„ = [xni, ■ ■ -Xna]^ € 
where Xni is the i-th bin of the histogram. Note that x„ is normalized as 
= 1, so that it is a distribution. The set of the histograms of all the 
training samples is denoted as X = {xi, • • • jX^r}, where x„ is the histogram 
of the n-th sample. 

Under the framework of sparse coding, we try to represent each histogram 
in A as a sparse linear combination of a set of basic histograms. We denote the 
set of basic histograms asU = {ui, ■ • • , um}, where = [umi, • • • Umo]^ € 
is the m-th basic histogram, and M is the number of basic histograms. 
Similar to x„, is also normalized as ^rni = 1- The basic histograms 
are further organized as a basic matrix U = [ui,-- - , um] G With 

the basic histograms, we try to reconstruct each x„ as the weighted linear 
combination of these basic histograms, as 

m 

where v„ = [vni,-'' G is the reconstruction coefficient vector, 

which is also called coding vector of x„. Vnm is the coefficient of the m-th basic 
histogram for the reconstruction of the histogram of the n-th sample. Similarly, 
the sparse coding vectors for all the samples in X could also be organized as 
a coding matrix as U = [vi, • • • , v^r] G with the n-th column as the 

coding vector of the n-th sample. Given the histogram x„ of n-th sample, the 
target of sparse coding problem is to learn a basic histogram matrix [/, and 
a sparse coding vector v„, so that the original x„ and its reconstruction C/v„ 
should be as close to each other as possible, and the reconstruction error can 
be minimized. At the same time, we also expect the coding vector v„ to be 


= UVr, 


( 1 ) 
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as sparse as possible. To this end, we discuss the following two issues to build 
the objective function for the learning of U and V. 

Reconstruction Error To measure the reconstruction error between x„ and 
Uvn, traditional sparse coding methods have used the squared L 2 norm 
distance, as 


L2(Xn,t/v„) = ||X„ - C/V„||2 , (2) 

for the learning of U and V. The objective function is built by applying 
the squared L 2 norm distance to all samples in the training set. However, 
as we discussed in the introduction section, L 2 norm distance is unsuitable 
for the histogram data. In this work, we try to apply the EMD as a dis¬ 
tance measure between x„ and t/v„, which has been a popular metric for 
histogram data. To define the EMD between two histograms x„ and C/v„, 
we treat each bin of x„ as a supply, while each bin of t/v„ as a demand. 
We also denote dij as the ground distance from the i-th supply to j-th de¬ 
mand. The EMD between x„ and C/v„ is defined as the minimum amount 
of work needed to fill all the demands with all the supplies, 


EMD{xn,Uvn) =min j 

(3) 

> 0 , ^ /” /” < UmjVnm, 

j i m 

where variable /j" denotes the amount transported from the f-th supply 
to the j-th demand for the n-th sample, and F” = is the matrix 

of the transported amounts. The constrain /"■ > 0 prevents the negative 
transportation. The constrain fij ^ Xni means that the mess moved 
out from the i-th supply should not be larger than Xni, while Ejrj < 
Sm ’^rnjVnm means that the mess moved into the j-th demand should not 
be larger than '^mjVnm- The problem in ([3]) could be solved as a Linear 
Programming (LP) problem [351I31IB]. 

Sparsity Regularization To encourage the sparsity of each coding vector v„, 
traditional sparse coding approaches have been imposing a Li norm based 
sparsity penalty to v„ as 


f ^ 1 

inin Fi(v„) = ||v„||i = Y \'^nm\ > (4) 

I m=l J 

Using the Li norm sparsity penalty, we can impose most of the the elements 
of v„ to zeros, and only a few of them will be kept for the reconstruction 
of x„. 

Direct optimization of this regularization term is difficulty, because it is 
non-convex and non-smooth. We follow the works of Fan et al. [T3i[3m 
mmm which are proposed to improve the lower bounds for Bayesian 
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network structure learning, and propose to optimize the upper bound of 
the Li norm regularization. Fan et al. m proposed a method to tighten the 
upper and lower bounds of the learning problem of the Bayesian network 
structure. In the work of Fan et al. [23] . more informed variable groupings 
are used to create the pattern databases for the tightening of the lower 
bounds, and an anytime learning algorithm is used for the tightening of the 
upper bounds. Moreover, Fan et al. m proposed a new partition method 
to use the information extracted from the potential optimal parent sets to 
improve the lower bound for Bayesian network structure learning. Inspired 
by the works of Fan et al. mm. to solve the problem together with the 
LP problem of EMD, instead of minimizing the Li norm of the code vector 
v„ directly, we introduce a slack vector the upper bound of its Li norm, 
and minimize its Li norm. We first introduce a nonnegative slack vector 
for each code vector as the upper bound of the absolute vector 
of the code vector, > |v„| > 0, where |v„| = [|v„i|, • • • , |v„m|] G 
and then minimize the Li norm of the slack vector to seek the sparsity 
of v„ indirectly. Because is a nonnegative vector, its Li norm could be 
computed simply as the summation of its elements as = J2m=i ^nm- 

To seek the sparsity of v„, we have the following optimization problem. 


min 


M 


Lliu = E 




(5) 


We also organize the upper bound vectors for the spare codes as a upper 
bound matrix S = G where the n-th column is 

the upper bound vector of the n-th sparse code vector. By using the slack 
vectors, we make the sparsity regularization term a smooth function, which 
could be integrated to the optimization problem of EMD naturally, and 
could solved as LP problem easily. 


By applying both the EMD based reconstruction error term in ([3]) and the 
sparsity regularization term in ([S]) to each training sample in X, and summing 
them up, we have the following objective function for the EMD based sparse 
coding problem. 


n 

S.t.Umj > (6) 

3 

^„> 0 , 

where 7 is a trade-off parameter, and the constrains Umj > 0 and Umj = 1 
are introduced to the basic histograms to guarantee that the learned basic his¬ 
tograms are normalized distributions. Please note that the i?MD(x„, {7v„) 
itself is also obtained by solving a minimizing problem with regarding to 
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• • ■ , . We substitute Q and (O to (O , so that the optimization problem 

in m is extended into the parameter-enlarged optimization with additional 
parameters of transported amount matrices as, 


E E + 7 E 

’ ’ ' Iri —1 \ j^j ^ 


S.t.Umj ^ 0 ; ^ ^ Ujjij — 1 ; 
3 


Umn'^n 


^nm E '^nm E ^nm j ^nm ^ 0- 

This problem is a parameter-enlarged LP problem. 


(7) 


2.2 Optimization 

Directly optimizing the object of © is difficult and time-consuming. Similar 
to the original L 2 norm-based sparse coding method, we adopt an alternate 
optimization strategy for the learning of U and (V, S) in an iterative algorithm. 
In each iteration, one of U and (Id, S') will be optimized while the other one 
is fixed, and then their roles will be switched. The iteration will be repeated 
until a maximum iteration number is reached. 

2.2.1 Optimizing (V,S) while fixing U 

By fixing U, we could optimize the coding vectors in V together with the 
other additional variables. Similar to the traditional sparse coding methods, 
we update each sparse coding vector individually. When the coding vector v„ 
of the n-th sample and its slack vector are being optimized, the other ones 
^ n) with their corresponding additional variables (F" and are 
fixed. Thus, the optimization problem in ([7]) will be turned to 


mm 


E + 7 F 

i^j m 

j i m 

fnm S '^nm S ^nm-!^nm 0; 


^nmi 


( 8 ) 


which could be solved as a LP problem. The LP problem is solve by using a 
active set algorithm. Please notice that LP solves a problem for a given vector 
of unknown variables. Here we substitute the vector of variables in F" (the 
original variables of the EMD problem) with a longer vector, which contains 
the entries in F", v„, and We did not reformulate the EMD objective, but 



Title Suppressed Due to Excessive Length 


7 


we changed its constraints to take care of the additional variables related to 
the the problem solved. In this way, the new LP problem is different from that 
of the original EMD, and the result contains entries of E", v„ and 

2.2.2 Optimizing U while fixing {V,Ei) 

By fixing V and S', the optimization problem in © can be turned to 
min ff.dij 

n 

S.t. ^ ^ y'mj — 1; ^mj ^ 0: 

3 

/^ > 0, y] /” < Xni, ^ ^rnjVnm- 

j i m 

which could also be solved as a LP problem using active set algorithm. 

An important limitation of both the optimization problems in ([8]) and Q is 
the large number of additional variables for the LP problem. For each sample 
x„, a I? X U transported amount matrix F” is solved in both ([5|) and ([S]), 
thus there are totally N x D x D transportation amount variables in the LP 
problem for the N training samples. When the dimension of the histogram D, 
or the training sample N is large, there would be a large number of variables, 
which could cause serious computation problem. To overcome this shortage, 
we reduce the number of variables in F” by allowing moving the earth from 
the i-th supply only to its K nearest demands instead of all the D demands. 
The K nearest demands of z-th supply is found by using the ground distances. 
In this way, we reduce the transported mass variables for each supply of each 
sample from D to K. Usually K D, thus the total transported amount is 
reduced significantly from NxDxD to NxDxK. 


2.3 Algorithm 

We summarize the iterative basic histograms and coding vectors learning al¬ 
gorithm in Algorithm [TJ In each iteration, the sparse coding vector for each 
sample is first learned sequentially, and the basic histograms are then updated 
based on the learned sparse coding vectors. The iterations will be repeated T 
times. When a novel sample comes with its histogram in the test procedure, 
we simply solve m to obtain its sparse coding vector. 


2.4 Relation to nonnegative matrix factorization with earth mover’s distance 

Sandler and Lindenbaum |60j proposed nonnegative matrix factorization with 
earth mover’s distance (NMFEMD), which factorize a nonnegative matrix by 
minimizing the EMD between the original data matrix and the product of two 
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Algorithm 1 SC-EMD Algorithm. 

Input: Histograms of training samples X = {x^, • • • , xjv}; 

Input: Number of basic histograms M; 

Input: Maximum number of iterations T. 

Initialize the basic histogram matrix = 
for t = 1, ■ - ■ , T do 
for n = 1, • • • do 

Update the sparse coding vector for the n-th sample by solving lEli while fixing 

end for 

Update the basic histogram matrix by solving 10 while fixing V*; 

end for 

Output: The basic histogram matrix U'^ and the sparse coding matrix . 


factorization matrices. Our work SC-EMD has close relation to it. We discuss 
the relations of the two methods as following: 

— Both SC-EMD and NMFEMD use earth mover’s distance to measure the 
reconstruction error of the data, which is a suitable distance measure for 
multi-instance quantization histogram. 

— NMFEMD dose impose the sparsity of the factorization matrices, while 
SC-EMD impose the reconstruction coefficients to be sparse. The sparsity 
of the reconstruction coefficients is measure by a Li norm. Thus the objec¬ 
tive function of SC-EMD is different compared to NMFEMD, because the 
objective of NMFEMD is only a EMD term, while the object of SC-EMD 
is composed of a EMD term and a Li term. The optimization of SC-EMD 
is more difficult than NMFEMD due to this additional term. 

— NMFEMD imposes the reconstruction matrices to be nonnegative, while 
SC-EMD doesn’t have such constraints. However, these constraints do not 
change the optimization of the objective. The NMFEMD is optimized as 
a linear constrained LP problem. Adding the nonnegative constraint only 
adds some more linear constraints to the problem. But SC-EMD adds a Li 
norm regularization term to the objective, and changes the optimization 
problem. 


3 Experiments 

In this section, we evaluated the proposed method on three multi-instance 
learning problems, where each feature vector is a histogram for each sample. 


3.1 Experiment I: Abnormal Image Detection in Wireless Capsule Endoscopy 
Videos 

Wireless Capsule Endoscopy (WCE) has been used to detect the mucosal 
abnormalities in the gastrointestinal tract, including blood, ulcer, polyp, etc 
pTinniKmiiH)] . However, usually only a few frames from a large number of 
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WCE videos contain abnormalities, thus a medical clinician spends long time 
to find the abnormal frames from a WCE video. In this situation, it is very 
necessary to develop a system to automatically discriminate abnormal frames 
from the normal ones. In this experiment, we evaluated the proposed method 
as image representation method for the task of abnormal image detection in 
WCE videos. 


3.1.1 Dataset and Setup 

We constructed the data set for the experiment by collecting 170 images of 
WCE videos belonging to three abnormal classes and one normal class. The 
data set contains 50 normal images, 40 polyp images, 40 ulcer images, and 
40 blood images. Given an image of WCE video, the task of abnormal image 
detection is to classify it to one of the four classes. To this end, each image 
was split into many 8x8 small patches, and each patch was treated as an 
instance, thus the image was represented as a bag of instances under the 
framework of multi-instance learning. We extract color and texture features 
from each patch and concatenate them as visual features of each instance. 
Then the instances were quantized into a pool of instance prototypes and the 
quantization histogram was normalized and used as the feature vector of the 
image. The histograms were further represented using the proposed SC-EMD 
algorithm as the sparse coding vectors, and the coding vectors were used to 
train a Support Vector Machine (SVM) [8l [6Tll391l67] to classify the images into 
one of the four image types. 

To conduct the experiment, we employed the 10-fold cross-validation pro¬ 
tocol miM- The entire data set was split into 10 non-overlapping folds 
randomly. In each fold, there were 5 normal images, 4 polyp images, 4 ulcer 
images, and 4 blood images respectively. Each fold was used as the test set in 
turn, and the remaining 9 folds are combined and used as the training set. After 
the images in the training set were represented as histograms under the multi¬ 
instance learning framework, we performed the SC-EMD algorithm to them 
and obtain the basic histograms and the sparse coding vectors for the train¬ 
ing images. Then we train a SVM classifier from these sparse coding vectors 
for each class. To handle the multi-class problem, we used the one-against-all 
protocol to train classifiers priHlUlHT] . A SVM classifier was trained for each 
class, using the images of this class as positive samples, while all other images 
as negative samples. Based on the basic histograms learned from training his¬ 
tograms, we represented the test images and obtain the sparse coding vectors, 
and finally input them into the learned SVM classifiers to have the final clas¬ 
sification results. Please note that the parameters were turned using only the 
training set while excluding the test set. 

The classification results are measured by the recall-precision curve EilMl 
EH, Receiver Operating Characteristic (ROC) curve [TTlimniTI and the Area 
Under the ROC Curve (AUC) value [H^HTlIin] for each class. 
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3.1.2 Results 

In the experiments, we compared our SC-EMD algorithm as a data representa¬ 
tion method against the traditional sparse coding method using the L 2 norm 
distance (denoted as SC-L 2 norm), and also against the original histogram 
as representation (denoted as Histogram). The recall-precision curves for four 
different classes are given in Fig.[T](a) - (d). In these figures, it is clearly shown 
that with the proposed SC-EMD, the classification performances for all four 
classes are improved significantly, even more so for the last three classes. The 
performance improvement is particularly dramatic for the Polyp and Normal 
classes. SC-L 2 norm could improve the original histogram features somehow, 
however, due to the reason that it employs the L 2 norm distance as loss func¬ 
tion, which is not suitable for the histogram data, the improvement is limited. 
In particular. Fig. [T] (a) shows that an increase in classification performance is 
obtained by SC-L 2 norm against both original histogram and SC-EMD. The 
results validate the importance of performing sparse coding with appropriate 
loss function to the histogram data. 

The ROC curves of different classes are shown in Fig. [2l Moreover, the 
AUC values are given in Fig. O As shown in Fig. [2] and Fig. [31 our SC-EMD 
algorithm clearly outperforms the original histogram feature and SC-T 2 norm- 
based method in all four classes again. The advantage is particularly significant 
on the more challenging normal class. This result highlights the importance of 
using the EMD measure for histograms rather than L 2 -norm distance. 


3.2 Experiment II: Protein binding site retrieval 

Searching geometrically similar protein binding sites is significantly important 
to understand the functions of protein and also to drug discovery 
sa. Pang et al. m presented the protein binding sites as a histogram using 
the multi-instance learning framework for the protein binding site retrieval 
problem. In this experiment, we evaluated the proposed algorithm for the 
representation of histograms of protein binding sites. 


3.2.1 Dataset and Setup 

In this experiment, we used a protein binding site data set reported by Pang et 
al. [M]. In this non-redundant data set, there are totally 2,819 protein binding- 
sites, belonging to 501 different classes. The number of sites in each class 
varies from 2 to 58. To conduct the 4-fold cross-validation, we had selected 
2,226 binding sites randomly to construct our data set. The selected data set 
contained sites of 249 classes, and the number of sites for each class was from 
4 to 58, so that we could guarantee that when the 4-fold cross-validation was 
performed, in each fold there were at least one site from every class. The 
numbers of sites for all the selected classes are shown in Fig. 01 
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Recall-Precision: Biood 




(a) Blood 


(b) Polyp 


Recali-Precision: Ulcer 


Recall-Precision: Normai 




(c) Ulcer (d) Normal 

Fig. 1 The recall-precession curves of different classes using different histogram represen¬ 
tation methods on the WCE images database. 


Given a query binding site and a protein binding site database, the pro¬ 
tein binding site retrieval problem is to rank the database sites according to 
their similarity to the query in a descending order, so that the database sites 
belonging to the same class as the query can be ranked at the top positions of 
the returned list. To this end, we first represented each binding site as a bag 
of feature points selected from the binding site surface, and for each point the 
geometric features were extracted m- In the multi-instance learning frame¬ 
work, a binding site was refereed as a bag, and each feature point was refereed 
as a instance. Then all the feature points were quantized into a set of pro¬ 
totype points, and a histogram was generated as the bag-level feature of the 
binding site [53]. Using the proposed SC-EMD algorithm, the histograms are 
represented as sparse codes for the final ranking. The ranking performances 
were evaluated by the recall-precision and the ROC curves. AUC values of the 
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ROC: Blood 



(a) Blood 



(b) Polyp 


ROC: Ulcer 



ROC: Normal 



(c) Ulcer 


(d) Normal 


Fig. 2 The ROC curves of different classes using different histogram representation methods 
on the WCE images database. 



Blood Polyp Ulcer Normal 


Fig. 3 The AUC valuse of different classes using different histogram representation methods 
on the WCE images database. 


ROC curve were also reported as a single performance measure of the ranking 
results. 
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Fig. 4 The numbers of sites in different classes of the protein binding site dataset. 


3.2.2 Results 

The recall-precision and ROC curves of different histogram representation 
methods are given in Fig. [S] From the results in Fig. [5l we can find that 
the SC-EMD method performs the best in terms of both recall-precision and 
ROC curves. It proves that the EMD based method could discover the best 
distance measure for histogram comparison and coding. The AUC values of 
the ROC curves are also given in Fig. [5] These protein binding site retrieval 
system with SC-EMD representation method achieves an AUC value of 0.9466, 
compared to an AUC value of 0.9282 using SC-L 2 norm and 0.9114 using the 
original histogram. 


Recall-Precision ROC 




(a) Recall-Precision 


(b) ROC 


Fig. 5 The recall-precision and ROC curves of different histogram representation methods 
on the protein binding site database. 
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SC-EMD SC-L_2 norm Histogram 


Fig. 6 The AUC values of ROC curves of different histogram representation methods on 
the protein binding site database. 


3.3 Experiment III: Object Recognition 

In this section, we compared the proposed method to some other feature ex¬ 
traction and classification methods on an publicly accessed image database. 


3.3.1 Dataset and Protocol 

In this experiment, we used the COREL-2000 image database which is popular 
in the computer vision community for the problem of object recognition insi 
[7]. In this data set, there were 2,000 images of 20 objects. For each object, the 
number of images was 100. The target of image recognition was to assign a 
given image to a class of object correctly. 

To conduct the experiment, we also used the 10-fold cross-validation. Each 
images was regarded as a sample, and we extracted the Regions Of Interest 
(ROI) as the instances [TU1[5^ . We extracted multiple ROIs for each image, 
thus each image was represented as a bag of multiple instances. Moreover, the 
instances of the training images were clustered to generate a set of instance 
prototypes using a clustering algorithm psirT^rTTiiHS] . and then the instances 
of each image were quantized into it to present each image as a histogram. The 
histograms were firstly normalized and then the proposed SC-EMD algorithm 
was applied to represent the histograms to sparse codes. The histogram of 
the training samples were used to learn the basic histograms and their sparse 
codes first, and then a SVM classifier was learned in the sparse code space for 
each class. To classify a test sample, we also represented its histogram to a 
sparse code vector using the basic histograms learned from the training set, 
and then classified the sparse code vector using the SVM classifier learned from 
the training set. To evaluate the classification performances of the proposed 
algorithm, we used the classification accuracy as the performance measure 
EUH, which is computed as, 

Number of correctly classified test images 


accuracy = 


Number of test images 


( 10 ) 
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3.3.2 Results 

In this experiment, we compared the proposed histogram representation algo¬ 
rithm against several visual feature extraction methods, including the Learning 
Locality-Constrained Collaborative Representation (LCCR) method proposed 
by Peng et al. [SH], SC-L 2 , Histogram of Oriented Gradients (HOG) [MlIRTl 
[13], and Scale-Invariant Feature Transform (SIFT) |45l|63j|9]. The experiment 
results are given in Fig. |7| (a). As we can see from this figure, the proposed 
method outperforms all the other methods significantly besides LCCR. SIFT 
and HOG both represent the images as histograms, however, they ignore the 
structure of the data set by representing each images individually, thus the 
performances are inferior to others. SG-L 2 explorers the training set by learn¬ 
ing a set of basic histograms to represent all the image histograms, and it 
achieves some minor improvements. But it uses the L 2 norm distance to com¬ 
pare the histograms, which is not suitable. It is very interesting to notice that 
LGGR, which improves the robustness and discrimination of data representa¬ 
tion by introducing the local consistency, has archived similar performances 
to SC-EMD and outperformed other methods too. Although it also uses the 
L 2 norm as loss function which is unsuitable for histograms, it considers the 
local consistency of the data samples, and seek the smoothness of the code in¬ 
stead of the sparsity. These are the main reasons for the good performances of 
LGGR. It also encourages us to develop novel methods to combine EMD and 
locality-constrained collaborative representation, which may even improve the 
performance more significantly. Moreover, we also compare our method to two 
popular classification methods, including Sparse Representation-based Clas¬ 
sification (SRC) [7D], Nearest Neighbor classification (NN) [T^IMlfTS] . The 
results is given in Fig.|7|(b). As we can see from the figure, the proposed algo¬ 
rithm based on EMD outperforms both the two classification using L 2 norm 
distance as distance metric for histograms. This is another evidence that EMD 
is essential for histogram data analysis and classification. 


4 Conclusion and Future Works 

A new type of sparse coding method, sparse coding with EMD metric, is 
proposed in this paper for the representation of histogram data. The objective 
function is composed of an EMD term between the original histogram and 
the reconstruction result from a pool of basic histograms, and a Li term for 
the regularization of the coding vector. The optimization problem is solved 
as a LP problem in an iterative algorithm. Algorithms based on the proposed 
SC-EMD outperformed previous L 2 -norm based sparse coding algorithms in 
three challenging multi-instance learning tasks. 
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Comparison to feature learning methods 



Comparison to classifiers 
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Fig. 7 Experiment results on the COREL-2000 image data set. 
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